Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Parallel fuzzy C-means clustering algorithm in Spark
WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli
Journal of Computer Applications    2016, 36 (2): 342-347.   DOI: 10.11772/j.issn.1001-9081.2016.02.0342
Abstract1114)      PDF (901KB)(1347)       Save
With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.
Reference | Related Articles | Metrics